Experimental Assessment of a Threshold Selection Algorithm for Tuning Classifiers in the Field of Hierarchical Text Categorization
نویسندگان
چکیده
Text Categorization is the task of assigning predefined categories to text documents. It can provide conceptual views of document collections and has many important applications in the real world. Nowadays, most of the research on text categorization has focused on mapping text documents to a set of categories among which structural relationships hold. Without loss of generality, let us assume that a classifier entrusted with recognizing documents of a given category outputs a degree of membership, usually a value in [0,1]. The behavior of any such classifier typically depends on an acceptance threshold, which turns the degree of membership into a dichotomous decision. In principle, the problem of finding the best acceptance thresholds for a set of classifiers related by taxonomic relationships is a difficult problem. Hence, any proposal aimed at finding suboptimal solutions to this problem may have great importance, especially in the field of hierarchical text categorization. In this paper, we make an experimental assessment of a greedy threshold selection algorithm aimed at finding a suboptimal combination of thresholds in a hierarchical text categorization setting. The quadratic complexity of the algorithm makes it easier to find good suboptimal solutions even for large taxonomies. Experimental results, performed on Reuters data collections, show that the proposed approach is able to find suboptimal solutions with small computational complexity.
منابع مشابه
A Comparative Experimental Assessment of a Threshold Selection Algorithm in Hierarchical Text Categorization
Most of the research on text categorization has focused on mapping text documents to a set of categories among which structural relationships hold, i.e., on hierarchical text categorization. For solutions of a hierarchical problem that make use of an ensemble of classifiers, the behavior of each classifier typically depends on an acceptance threshold, which turns a degree of membership into a d...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملHierarchical text categorization using fuzzy relational thesaurus
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT is a multilevel category system that stores and maintains adaptive local dictionary for each category. The goal of our approach is twofold; to develop a reliable t...
متن کاملKAN and RinSCut: Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization
Two important research areas in statistical approaches for automated text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization systems. After researching common techniques in both areas, we describe a lazy linear classifier known as the keyword a...
متن کاملA Real-Time Electroencephalography Classification in Emotion Assessment Based on Synthetic Statistical-Frequency Feature Extraction and Feature Selection
Purpose: To assess three main emotions (happy, sad and calm) by various classifiers, using appropriate feature extraction and feature selection. Materials and Methods: In this study a combination of Power Spectral Density and a series of statistical features are proposed as statistical-frequency features. Next, a feature selection method from pattern recognition (PR) Tools is presented to e...
متن کامل